In [ ]:
%matplotlib inline

import json
import codecs

Topic 2: Collecting Social Media Data

This notebook contains examples for using web-based APIs (Application Programmer Interfaces) to download data from social media platforms. Our examples will include:

  • Reddit
  • Facebook
  • Twitter

For most services, we need to register with the platform in order to use their API. Instructions for the registration processes are outlined in each specific section below.

We will use APIs because they can be much faster than manually copying and pasting data from the web site, APIs provide uniform methods for accessing resources (searching for keywords, places, or dates), and it should conform to the platform's terms of service (important for partnering and publications). Note however that each of these platforms has strict limits on access times: e.g., requests per hour, search history depth, maximum number of items returned per request, and similar.


Topic 2.1: Reddit API

Reddit's API used to be the easiest to use since it did not require credentials to access data on its subreddit pages. Unfortunately, this process has been changed, and developers now need to create a Reddit application on Reddit's app page located here: (https://www.reddit.com/prefs/apps/).


In [ ]:
# For our first piece of code, we need to import the package 
# that connects to Reddit. Praw is a thin wrapper around reddit's 
# web APIs and works well

import praw

Creating a Reddit Application

Go to https://www.reddit.com/prefs/apps/. Scroll down to "create application", select "web app", and provide a name, description, and URL (which can be anything).

After you press "create app", you will be redirected to a new page with information about your application. Copy the unique identifiers below "web app" and beside "secret". These are your client_id and client_secret values, which you need below.


In [ ]:
# Now we specify a "unique" user agent for our code
# This is primarily for identification, I think, and some
# user-agents of bad actors might be blocked
redditApi = praw.Reddit(client_id='OdpBKZ1utVJw8Q',
                        client_secret='KH5zzauulUBG45W-XYeAS5a2EdA',
                        user_agent='crisis_informatics_v01')

Capturing Reddit Posts

Now for a given subreddit, we can get the newest posts to that sub. Post titles are generally short, so you could treat them as something similar to a tweet.


In [ ]:
subreddit = "worldnews"

targetSub = redditApi.subreddit(subreddit)

submissions = targetSub.new(limit=10)
for post in submissions:
    print(post.title)

Leveraging Reddit's Voting

Getting the new posts gives us the most up-to-date information. You can also get the "hot" posts, "top" posts, etc. that should be of higher quality. In theory. Caveat emptor


In [ ]:
subreddit = "worldnews"

targetSub = redditApi.subreddit(subreddit)

submissions = targetSub.hot(limit=5)
for post in submissions:
    print(post.title)

Following Multiple Subreddits

Reddit has a mechanism called "multireddits" that essentially allow you to view multiple reddits together as though they were one. To do this, you need to concatenate your subreddits of interesting using the "+" sign.


In [ ]:
subreddit = "worldnews+aww"

targetSub = redditApi.subreddit(subreddit)
submissions = targetSub.new(limit=10)
for post in submissions:
    print(post.title)

Accessing Reddit Comments

While you're never supposed to read the comments, for certain live streams or new and rising posts, the comments may provide useful insight into events on the ground or people's sentiment. New posts may not have comments yet though.

Comments are attached to the post title, so for a given submission, you can pull its comments directly.

Note Reddit returns pages of comments to prevent server overload, so you will not get all comments at once and will have to write code for getting more comments than the top ones returned at first. This pagination is performed using the MoreXYZ objects (e.g., MoreComments or MorePosts).


In [ ]:
subreddit = "worldnews"

breadthCommentCount = 5

targetSub = redditApi.subreddit(subreddit)
submissions = targetSub.hot(limit=1)
for post in submissions:
    print (post.title)
    
    post.comment_limit = breadthCommentCount
    
    # Get the top few comments
    for comment in post.comments.list():
        if isinstance(comment, praw.models.MoreComments):
            continue
        
        print ("---", comment.name, "---")
        print ("\t", comment.body)
        
        for reply in comment.replies.list():
            if isinstance(reply, praw.models.MoreComments):
                continue
            
            print ("\t", "---", reply.name, "---")
            print ("\t\t", reply.body)

Other Functionality

Reddit has a deep comment structure, and the code above only goes two levels down (top comment and top comment reply). You can view Praw's additional functionality, replete with examples on its website here: http://praw.readthedocs.io/


Topic 2.2: Facebook API

Getting access to Facebook's API is slightly easier than Twitter's in that you can go to the Graph API explorer, grab an access token, and immediately start playing around with the API. The access token isn't good forever though, so if you plan on doing long-term analysis or data capture, you'll need to go the full OAuth route and generate tokens using the approved paths.


In [ ]:
# As before, the first thing we do is import the Facebook
# wrapper

import facebook

Connecting to the Facebook Graph

Facebook has a "Graph API" that lets you explore its social graph. For privacy concerns, however, Facebook's Graph API is extremely limited in the kinds of data it can view. For instance, Graph API applications can now only view profiles of people who already have installed that particular application. These restrictions make it quite difficult to see a lot of Facebook's data.

That being said, Facebook does have many popular public pages (e.g., BBC World News), and articles or messages posted by these public pages are accessible. In addition, many posts and comments made in reply to these public posts are also publically available for us to explore.

To connect to Facebook's API though, we need an access token (unlike Reddit's API). Fortunately, for research and testing purposes, getting an access token is very easy.

Acquiring a Facebook Access Token

  1. Log in to your Facebook account
  2. Go to Facebook's Graph Explorer (https://developers.facebook.com/tools/explorer/)
  3. Copy the long string out of "Access Token" box and paste it in the code cell bedlow


In [ ]:
fbAccessToken = "EAACEdEose0cBAKZAZBoGzF6ZAJBk3uSB0gXSgxPrZBJ5nsZCXkM25xZBT0GzVABvsZBOvARxRukoLxhVEyO42QO1D1IInuE1ZBgQfffxh10BC0iHJmnKfNGHn9bY6ioZA8gHTYAXoOGL0A07hZBKXxMKO1yS3ZAPDB50MVGLBxDjJJDWAYBFhUIoeaAaMAZAzxcT4lMZD"

Now we can use the Facebook Graph API with this temporary access token (it does expire after maybe 15 minutes).


In [ ]:
# Connect to the graph API, note we use version 2.5
graph = facebook.GraphAPI(access_token=fbAccessToken, version='2.5')

Parsing Posts from a Public Page

To get a public page's posts, all you need is the name of the page. Then we can pull the page's feed, and for each post on the page, we can pull its comments and the name of the comment's author. While it's unlikely that we can get more user information than that, author name and sentiment or text analytics can give insight into bursting topics and demographics.


In [ ]:
# What page to look at?
targetPage = "nytimes"

# Other options for pages:
# nytimes, bbc, bbcamerica, bbcafrica, redcross, disaster

maxPosts = 10 # How many posts should we pull?
maxComments = 5 # How many comments for each post?

post = graph.get_object(id=targetPage + '/feed')

# For each post, print its message content and its ID
for v in post["data"][:maxPosts]:
    print ("---")
    print (v["message"], v["id"])
        
    # For each comment on this post, print its number, 
    # the name of the author, and the message content
    print ("Comments:")
    comments = graph.get_object(id='%s/comments' % v["id"])
    for (i, comment) in enumerate(comments["data"][:maxComments]):
        print ("\t", i, comment["from"]["name"], comment["message"])


Topic 2.1: Twitter API

Twitter's API is probably the most useful and flexible but takes several steps to configure. To get access to the API, you first need to have a Twitter account and have a mobile phone number (or any number that can receive text messages) attached to that account. Then, we'll use Twitter's developer portal to create an "app" that will then give us the keys tokens and keys (essentially IDs and passwords) we will need to connect to the API.

So, in summary, the general steps are:

  1. Have a Twitter account,
  2. Configure your Twitter account with your mobile number,
  3. Create an app on Twitter's developer site, and
  4. Generate consumer and access keys and secrets.

We will then plug these four strings into the code below.


In [ ]:
# For our first piece of code, we need to import the package 
# that connects to Twitter. Tweepy is a popular and fully featured
# implementation.

import tweepy

Creating Twitter Credentials

For more in-depth instructions for creating a Twitter account and/or setting up a Twitter account to use the following code, I will provide a walkthrough on configuring and generating this information.

First, we assume you already have a Twitter account. If this is not true, either create one real quick or follow along. See the attached figures.

  • Step 1. Create a Twitter account If you haven't already done this, do this now at Twitter.com.

  • Step 2. Setting your mobile number Log into Twitter and go to "Settings." From there, click "Mobile" and fill in an SMS-enabled phone number. You will be asked to confirm this number once it's set, and you'll need to do so before you can create any apps for the next step.

  • Step 3. Create an app in Twitter's Dev site Go to (apps.twitter.com), and click the "Create New App" button. Fill in the "Name," "Description," and "Website" fields, leaving the callback one blank (we're not going to use it). Note that the website must be a fully qualified URL, so it should look like: http://test.url.com. Then scroll down and read the developer agreement, checking that agree, and finally click "Create your Twitter application."

  • Step 4. Generate keys and tokens with this app After your application has been created, you will see a summary page like the one below. Click "Keys and Access Tokens" to view and manage keys. Scroll down and click "Create my access token." After a moment, your page should refresh, and it should show you four long strings of characters and numbers, a consume key, consumer secret, an access token, and an access secret (note these are case-sensitive!). Copy and past these four strings into the quotes in the code cell below.


In [ ]:
# Use the strings from your Twitter app webpage to populate these four 
# variables. Be sure and put the strings BETWEEN the quotation marks
# to make it a valid Python string.

consumer_key = "IQ03DPOdXz95N3rTm2iMNE8va"
consumer_secret = "0qGHOXVSX1D1ffP7BfpIxqFalLfgVIqpecXQy9SrUVCGkJ8hmo"
access_token = "867193453159096320-6oUq9riQW8UBa6nD3davJ0SUe9MvZrZ"
access_secret = "5zMwq2DVhxBnvjabM5SU2Imkoei3AE6UtdeOQ0tzR9eNU"

Connecting to Twitter

Once we have the authentication details set, we can connect to Twitter using the Tweepy OAuth handler, as below.


In [ ]:
# Now we use the configured authentication information to connect
# to Twitter's API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

print("Connected to Twitter!")

Testing our Connection

Now that we are connected to Twitter, let's do a brief check that we can read tweets by pulling the first few tweets from our own timeline (or the account associated with your Twitter app) and printing them.


In [ ]:
# Get tweets from our timeline
public_tweets = api.home_timeline()

# print the first five authors and tweet texts
for tweet in public_tweets[:5]:
    print (tweet.author.screen_name, tweet.author.name, "said:", tweet.text)

Searching Twitter for Keywords

Now that we're connected, we can search Twitter for specific keywords with relative ease just like you were using Twitter's search box. While this search only goes back 7 days and/or 1,500 tweets (whichever is less), it can be powerful if an event you want to track just started.

Note that you might have to deal with paging if you get lots of data. Twitter will only return you one page of up to 100 tweets at a time.


In [ ]:
# Our search string
queryString = "earthquake"

# Perform the search
matchingTweets = api.search(queryString)

print ("Searched for:", queryString)
print ("Number found:", len(matchingTweets))

# For each tweet that matches our query, print the author and text
print ("\nTweets:")
for tweet in matchingTweets:
    print (tweet.author.screen_name, tweet.text)

More Complex Queries

Twitter's Search API exposes many capabilities, like filtering for media, links, mentions, geolocations, dates, etc. We can access these capabilities directly with the search function.

For a list of operators Twitter supports, go here: https://dev.twitter.com/rest/public/search


In [ ]:
# Lets find only media or links about earthquakes
queryString = "earthquake (filter:media OR filter:links)"

# Perform the search
matchingTweets = api.search(queryString)

print ("Searched for:", queryString)
print ("Number found:", len(matchingTweets))

# For each tweet that matches our query, print the author and text
print ("\nTweets:")
for tweet in matchingTweets:
    print (tweet.author.screen_name, tweet.text)

Dealing with Pages

As mentioned, Twitter serves results in pages. To get all results, we can use Tweepy's Cursor implementation, which handles this iteration through pages for us in the background.


In [ ]:
# Lets find only media or links about earthquakes
queryString = "earthquake (filter:media OR filter:links)"

# How many tweets should we fetch? Upper limit is 1,500
maxToReturn = 100

# Perform the search, and for each tweet that matches our query, 
# print the author and text
print ("\nTweets:")
for status in tweepy.Cursor(api.search, q=queryString).items(maxToReturn):
    print (status.author.screen_name, status.text)

Other Search Functionality

The Tweepy wrapper and Twitter API is pretty extensive. You can do things like pull the last 3,200 tweets from other users' timelines, find all retweets of your account, get follower lists, search for users matching a query, etc.

More information on Tweepy's capabilities are available at its documentation page: (http://tweepy.readthedocs.io/en/v3.5.0/api.html)

Other information on the Twitter API is available here: (https://dev.twitter.com/rest/public/search).

Twitter Streaming

Up to this point, all of our work has been retrospective. An event has occurred, and we want to see how Twitter responded over some period of time.

To follow an event in real time, Twitter and Tweepy support Twitter streaming. Streaming is a bit complicated, but it essentially lets of track a set of keywords, places, or users.

To keep things simple, I will provide a simple class and show methods for printing the first few tweets. Larger solutions exist specifically for handling Twitter streaming.

You could take this code though and easily extend it by writing data to a file rather than the console. I've marked where that code could be inserted.


In [ ]:
# First, we need to create our own listener for the stream
# that will stop after a few tweets
class LocalStreamListener(tweepy.StreamListener):
    """A simple stream listener that breaks out after X tweets"""
    
    # Max number of tweets
    maxTweetCount = 10
    
    # Set current counter
    def __init__(self):
        tweepy.StreamListener.__init__(self)
        self.currentTweetCount = 0
        
        # For writing out to a file
        self.filePtr = None
        
    # Create a log file
    def set_log_file(self, newFile):
        if ( self.filePtr ):
            self.filePtr.close()
            
        self.filePtr = newFile
        
    # Close log file
    def close_log_file(self):
        if ( self.filePtr ):
            self.filePtr.close()
    
    # Pass data up to parent then check if we should stop
    def on_data(self, data):

        print (self.currentTweetCount)
        
        tweepy.StreamListener.on_data(self, data)
            
        if ( self.currentTweetCount >= self.maxTweetCount ):
            return False

    # Increment the number of statuses we've seen
    def on_status(self, status):
        self.currentTweetCount += 1
        
        # Could write this status to a file instead of to the console
        print (status.text)
        
        # If we have specified a file, write to it
        if ( self.filePtr ):
            self.filePtr.write("%s\n" % status._json)
        
    # Error handling below here
    def on_exception(self, exc):
        print (exc)

    def on_limit(self, track):
        """Called when a limitation notice arrives"""
        print ("Limit", track)
        return

    def on_error(self, status_code):
        """Called when a non-200 status code is returned"""
        print ("Error:", status_code)
        return False

    def on_timeout(self):
        """Called when stream connection times out"""
        print ("Timeout")
        return

    def on_disconnect(self, notice):
        """Called when twitter sends a disconnect notice
        """
        print ("Disconnect:", notice)
        return

    def on_warning(self, notice):
        print ("Warning:", notice)
        """Called when a disconnection warning message arrives"""

Now we set up the stream using the listener above


In [ ]:
listener = LocalStreamListener()
localStream = tweepy.Stream(api.auth, listener)

In [ ]:
# Stream based on keywords
localStream.filter(track=['earthquake', 'disaster'])

In [ ]:
listener = LocalStreamListener()
localStream = tweepy.Stream(api.auth, listener)

# List of screen names to track
screenNames = ['bbcbreaking', 'CNews', 'bbc', 'nytimes']

# Twitter stream uses user IDs instead of names
# so we must convert
userIds = []
for sn in screenNames:
    user = api.get_user(sn)
    userIds.append(user.id_str)

# Stream based on users
localStream.filter(follow=userIds)

In [ ]:
listener = LocalStreamListener()
localStream = tweepy.Stream(api.auth, listener)

# Specify coordinates for a bounding box around area of interest
# In this case, we use San Francisco
swCornerLat = 36.8
swCornerLon = -122.75
neCornerLat = 37.8
neCornerLon = -121.75

boxArray = [swCornerLon, swCornerLat, neCornerLon, neCornerLat]

# Say we want to write these tweets to a file
listener.set_log_file(codecs.open("tweet_log.json", "w", "utf8"))

# Stream based on location
localStream.filter(locations=boxArray)

# Close the log file
listener.close_log_file()

In [ ]: